# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from linearmodels import PanelOLS
L18: Robust panel regression
Lecture overview
Remember from last lecture that we are using the following empirical question to showcase the statistical tools needed to run robust regression analysis:
Which of the following firm characteristics (if any) have statistically significant predictive power over firms’ profitability: the firm’s cash holdings, its book leverage or its capital investments?
In the previous lecture, we collected the data we needed for this analysis, produced some summary statistics and ran a basic linear regression where firm future profitability is the dependent variable, and firm cash holdings, book leverage, and investment are the explanatory variables. In this lecture, we continue this analysis by tackling two very common issues with linear regression analysis:
- The potential presence of “fixed-effects” in the data
- The issue of correlated error terms in the regression
The statsmodels
package we used for the introductory regression materials does not implement some of the tools we will discuss in this lecture. So in these lecture notes, we will be using the linearmodels
package, which can be installed by typing:
pip install linearmodels
in a Terminal or Anaconda Prompt. Once you install the package, import the PanelOLS
subpackage as below:
Preliminaries
# Load data from last time
= pd.read_pickle('../data/comp_clean.zip')
raw raw.dtypes
permno float64
datadate object
ib float64
at float64
che float64
dltt float64
ppent float64
sich float64
year int64
roa float64
future_roa float64
cash float64
leverage float64
investment float64
w_future_roa float64
w_cash float64
w_leverage float64
w_investment float64
const int64
dtype: object
# Make lists of variable names for convenience
= 'w_future_roa'
yvar = ['const','w_cash', 'w_leverage','w_investment']
xvars = [yvar] + xvars main_vars
# Keep only the data we need and set the index
= raw[['permno','year','sich'] + main_vars].copy()
comp 'const'] = 1
comp[= comp.set_index(['permno','year'])
comp comp
sich | w_future_roa | const | w_cash | w_leverage | w_investment | ||
---|---|---|---|---|---|---|---|
permno | year | ||||||
10000.0 | 1986 | NaN | NaN | 1 | 0.164539 | 0.027423 | NaN |
10001.0 | 1986 | NaN | 0.026506 | 1 | 0.060938 | 0.240647 | NaN |
1987 | 4924.0 | 0.046187 | 1 | 0.061932 | 0.233625 | -0.000170 | |
1988 | 4924.0 | 0.065069 | 1 | 0.063400 | 0.217725 | -0.019599 | |
1989 | 4924.0 | 0.059901 | 1 | 0.063399 | 0.396984 | 0.332669 | |
... | ... | ... | ... | ... | ... | ... | ... |
93436.0 | 2016 | 3711.0 | -0.068448 | 1 | 0.154374 | 0.267113 | 0.361891 |
2017 | 3711.0 | -0.032821 | 1 | 0.122952 | 0.331046 | 0.190355 | |
2018 | 3711.0 | -0.025125 | 1 | 0.130404 | 0.317894 | -0.026913 | |
2019 | 3711.0 | 0.013826 | 1 | 0.189863 | 0.368038 | 0.014800 | |
2020 | 3711.0 | NaN | 1 | 0.376275 | 0.208790 | 0.060904 |
237017 rows × 6 columns
Endogeneity
We say that your regression may suffer from an endogeneity problem (or an endogeneity bias) if you suspect that the mean independence assumption (see assumption A2 in the regression intro lecture) is not satisfied, i.e. if you think that:
\[E[\epsilon_t | X] \neq0\]
There are many reasons why this issue might arise (look up “omitted variable bias”, “reverse causality bias”, and “measurement error bias” if you are interested in a deeper analysis). We will not go into each of these possible sources of endogeneity. Here, we only describe the two common ways to address endogeneity issues, and we implement only the latter.
- Instrumental Variables (IV) estimation
- The main idea behind this approach is to find, for every endogenous variable X, another variable Z (called an “instrument”) which is correlated with X (aka the “relevance” condition), but does not affect the dependent variable in any way other than through its relation with X (aka the “validity” condition). The Z instrument is then used to extract the exogenous variation in X, which in turn is used in our main regression instead of X.
- This is a very general approach (it can be used regardless of what is causing the endogeneity issue) but it’s a bit too advanced to cover in this course. I will simply mention that the “linearmodels” package we use in this lecture can also run IV estimation using the “IV2SLS” subpackage and I’ll leave this for you to study at your own pace.
- Fixed effects estimation
- This approach deals with the situation in which the endogeneity problem is caused by some unobservable, omitted variable, that is constant either in the cross section or over time
- Example 1: firm fixed effects
- It may be possible that the firm’s ROA is also determined by management quality (which we can not measure easily). If high-quality managers, say, also like to hold a lot of cash, then the cash holdings variable in endogenous (in the equation above, cash holdings is part of X and management quality would be part of \(\epsilon\) since it affects ROA but is not part of our explanatory variables X). However, if management quality is relatively constant over time, we can control for its effects on ROA by demeaning the data at the firm level. This is what a firm fixed effects estimator does.
- Example 2: time fixed effects
- It may be the case that, in any given year, some macroeconomic shock affects both ROAs and, cash holdings for all firms (e.g. a recession will decrease ROAs and increase cash holdings almost across the board). If this is the case, then cash holdings is again endogenous. But if the macroeconomic shock is the (approximately) the same for all firms, then we can control for its effects (and fix our endogeneity problem) by demeaning the data at the year level. This is what a time fixed effect estimator does.
- Example 1: firm fixed effects
- Below, we show how control for both firm and year fixed effects in our example application
- This approach deals with the situation in which the endogeneity problem is caused by some unobservable, omitted variable, that is constant either in the cross section or over time
We will estimate fixed-effects regressions using the PanelOLS
function that we imported above:
Abbreviated syntax:
=False, time_effects=False, other_effects=None) PanelOLS(dependent, exog, ,entity_effects
The first two arguments is where you tell the function what to use for the dependent variable and independent variables respectively. For firm fixed effects, you set entity_effects = True
, for time fixed-effects, you set time_effects = True
and for fixed effects at any other level (e.g. industry), you have to specify the name of the variable that determines which observation is in what group (e.g. an industry identifier for industry fixed-effects). For entity_effects
and time_effects
, PanelOLS assumes that the first dimension of the index contains the firm identifier, and the second dimension contains the time identifier (which is why we used set_index(['permno','year'])
above).
# Run basic regression again, for comparison (ignore warning about missing values if you get one)
= PanelOLS(dependent = comp[yvar],
results = comp[xvars],
exog ;
).fit()print(results.summary)
C:\Users\ionmi\anaconda3\lib\site-packages\linearmodels\panel\data.py:98: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_categorical(s):
C:\Users\ionmi\anaconda3\lib\site-packages\linearmodels\utility.py:549: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
warnings.warn(missing_value_warning_msg, MissingValueWarning)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: w_future_roa R-squared: 0.0955
Estimator: PanelOLS R-squared (Between): 0.1031
No. Observations: 185315 R-squared (Within): -0.0642
Date: Fri, Feb 25 2022 R-squared (Overall): 0.0955
Time: 14:36:53 Log-likelihood 5498.7
Cov. Estimator: Unadjusted
F-statistic: 6524.2
Entities: 19139 P-value 0.0000
Avg Obs: 9.6826 Distribution: F(3,185311)
Min Obs: 1.0000
Max Obs: 80.000 F-statistic (robust): 6524.2
P-value 0.0000
Time periods: 41 Distribution: F(3,185311)
Avg Obs: 4519.9
Min Obs: 4.0000
Max Obs: 6276.0
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
const 0.0314 0.0010 32.312 0.0000 0.0295 0.0334
w_cash -0.3712 0.0028 -134.43 0.0000 -0.3766 -0.3658
w_leverage -0.0727 0.0030 -23.971 0.0000 -0.0787 -0.0668
w_investment 0.1634 0.0064 25.736 0.0000 0.1510 0.1759
================================================================================
You can check that the above results are the identical to the ones we obtained in the last lecture, using statsmodels.api.OLS
. The PanelOLS
function also tells us that we have 19,139 different entities (firms) in our sample, and 41 different time periods (years).
Firm fixed effects
= PanelOLS(dependent = comp[yvar],
results_firmfe = comp[xvars],
exog = True
entity_effects ;
).fit()print(results_firmfe.summary)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: w_future_roa R-squared: 0.0035
Estimator: PanelOLS R-squared (Between): -0.0776
No. Observations: 185315 R-squared (Within): 0.0035
Date: Fri, Feb 25 2022 R-squared (Overall): -0.0129
Time: 14:36:53 Log-likelihood 8.094e+04
Cov. Estimator: Unadjusted
F-statistic: 196.81
Entities: 19139 P-value 0.0000
Avg Obs: 9.6826 Distribution: F(3,166173)
Min Obs: 1.0000
Max Obs: 80.000 F-statistic (robust): 196.81
P-value 0.0000
Time periods: 41 Distribution: F(3,166173)
Avg Obs: 4519.9
Min Obs: 4.0000
Max Obs: 6276.0
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
const -0.0357 0.0010 -34.112 0.0000 -0.0377 -0.0336
w_cash 0.0208 0.0038 5.4863 0.0000 0.0134 0.0283
w_leverage -0.0557 0.0036 -15.452 0.0000 -0.0628 -0.0487
w_investment 0.0862 0.0050 17.261 0.0000 0.0764 0.0960
================================================================================
F-test for Poolability: 10.917
P-value: 0.0000
Distribution: F(19138,166173)
Included effects: Entity
The P-value under F-test for Poolability
is very low, which tells us that the firm fixed effects are jointly statistically significant in our regression (i.e. we should keep them in our regression).
Note how the coefficients have changed now that we have included firm fixed effects in our regression. In particular, note that the coefficient on w_cash
has changed sign.
Time fixed effects
= PanelOLS(dependent = comp[yvar],
results_timefe = comp[xvars],
exog = True
time_effects ;
).fit()print(results_timefe.summary)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: w_future_roa R-squared: 0.0944
Estimator: PanelOLS R-squared (Between): 0.1033
No. Observations: 185315 R-squared (Within): -0.0645
Date: Fri, Feb 25 2022 R-squared (Overall): 0.0955
Time: 14:36:54 Log-likelihood 6386.9
Cov. Estimator: Unadjusted
F-statistic: 6436.1
Entities: 19139 P-value 0.0000
Avg Obs: 9.6826 Distribution: F(3,185271)
Min Obs: 1.0000
Max Obs: 80.000 F-statistic (robust): 6436.1
P-value 0.0000
Time periods: 41 Distribution: F(3,185271)
Avg Obs: 4519.9
Min Obs: 4.0000
Max Obs: 6276.0
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
const 0.0308 0.0010 31.563 0.0000 0.0289 0.0327
w_cash -0.3710 0.0028 -133.37 0.0000 -0.3765 -0.3656
w_leverage -0.0694 0.0030 -22.855 0.0000 -0.0754 -0.0634
w_investment 0.1677 0.0064 26.382 0.0000 0.1553 0.1802
================================================================================
F-test for Poolability: 44.613
P-value: 0.0000
Distribution: F(40,185271)
Included effects: Time
Once again, the P-value for the F-test for Poolability
is very small, which means we should also keep the time fixed effects in our regression. Combined with the previous result, this means we should be including both firm and time fixed effects, which is what we do below.
Note also how the coefficient on w_cash
has changed sign again.
Both time and year fixed effects:
= PanelOLS(dependent = comp[yvar],
results_bothfe = comp[xvars],
exog = True, time_effects = True,
entity_effects ;
).fit()print(results_bothfe.summary)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: w_future_roa R-squared: 0.0029
Estimator: PanelOLS R-squared (Between): -0.0734
No. Observations: 185315 R-squared (Within): 0.0035
Date: Fri, Feb 25 2022 R-squared (Overall): -0.0097
Time: 14:36:55 Log-likelihood 8.184e+04
Cov. Estimator: Unadjusted
F-statistic: 160.89
Entities: 19139 P-value 0.0000
Avg Obs: 9.6826 Distribution: F(3,166133)
Min Obs: 1.0000
Max Obs: 80.000 F-statistic (robust): 160.89
P-value 0.0000
Time periods: 41 Distribution: F(3,166133)
Avg Obs: 4519.9
Min Obs: 4.0000
Max Obs: 6276.0
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
const -0.0358 0.0010 -34.326 0.0000 -0.0379 -0.0338
w_cash 0.0162 0.0038 4.2602 0.0000 0.0087 0.0236
w_leverage -0.0499 0.0036 -13.775 0.0000 -0.0570 -0.0428
w_investment 0.0813 0.0050 16.210 0.0000 0.0715 0.0912
================================================================================
F-test for Poolability: 11.083
P-value: 0.0000
Distribution: F(19178,166133)
Included effects: Entity, Time
In this final specification, it seems like cash holdings are positively associated with future profitability.
Sector fixed effects
'sic2d'] = comp['sich'].astype('string').str[0:2]
comp['sic2d'].value_counts() comp[
73 18965
28 16678
36 13361
60 11046
38 10881
...
76 86
81 34
86 11
90 11
89 8
Name: sic2d, Length: 69, dtype: Int64
Note that ‘sic2d’ contains missing values, which will give us an error if we try to use them as fixed-effects. So we get rid of all missing values in our regression data, and store this in a new dataframe first:
= comp[main_vars + ['sic2d']].dropna() df
Now we can run our industry fixed-effects regression:
= PanelOLS(dependent = df[yvar],
results_indfe = df[xvars],
exog = df['sic2d']
other_effects ;
).fit()print(results_indfe.summary)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: w_future_roa R-squared: 0.0571
Estimator: PanelOLS R-squared (Between): 0.0906
No. Observations: 153676 R-squared (Within): -0.0440
Date: Fri, Feb 25 2022 R-squared (Overall): 0.1012
Time: 14:36:56 Log-likelihood 1745.7
Cov. Estimator: Unadjusted
F-statistic: 3102.9
Entities: 23812 P-value 0.0000
Avg Obs: 6.4537 Distribution: F(3,153604)
Min Obs: 0.0000
Max Obs: 67.000 F-statistic (robust): 3102.9
P-value 0.0000
Time periods: 41 Distribution: F(3,153604)
Avg Obs: 3748.2
Min Obs: 0.0000
Max Obs: 5811.0
Parameter Estimates
================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------
const 0.0158 0.0012 13.445 0.0000 0.0135 0.0181
w_cash -0.3090 0.0034 -91.633 0.0000 -0.3156 -0.3024
w_leverage -0.0573 0.0036 -15.832 0.0000 -0.0644 -0.0502
w_investment 0.1727 0.0074 23.452 0.0000 0.1583 0.1871
================================================================================
F-test for Poolability: 85.603
P-value: 0.0000
Distribution: F(68,153604)
Included effects: Other Effect (sic2d)
Model includes 5 other effects
Other Effect Observations per group (sic2d):
Avg Obs: 2227.2, Min Obs: 4.0000, Max Obs: 1.44e+04, Groups: 69